What Makes for an Effective Data Practitioner in 2024


Marck Vaisman

Senior Technical Specialist, Microsoft
marck.vaisman@microsoft.com

Adjunct Professor, Data Science, Georgetown University
marck.vaisman@georgetown.edu

These are my personal opinions and do not represent any organization

Inspiration

In 2010, Drew Conway gave us this - our somewhat reductionist definition of a data unicorn

The OG Data Science Cheat Sheet

Analyzing the Analyzers, 2012-2013

(Harris, Murphy, and Vaisman 2013)

Skills distilled from our survey (in alphabetical order)

  • Algorithms (ex: computational complexity, CS theory)
  • Back-End Programming (ex: JAVA/Rails/Objective C)
  • Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS)
  • Big and Distributed Data (ex: Hadoop, Map/Reduce)
  • Business (ex: management, business development, budgeting)
  • Classical Statistics (ex: general linear model, ANOVA)
  • Data Manipulation (ex: regexes, R, SAS, web scraping)
  • Front-End Programming (ex: JavaScript, HTML, CSS)
  • Graphical Models (ex: social networks, Bayes networks)
  • Machine Learning (ex: decision trees, neural nets, SVM, clustering)
  • Math (ex: linear algebra, real analysis, calculus)

  • Optimization (ex: linear, integer, convex, global)
  • Product Development (ex: design, project management)
  • Science (ex: experimental design, technical writing/publishing)
  • Simulation (ex: discrete, agent-based, continuous)
  • Spatial Statistics (ex: geographic covariates, GIS)
  • Structured Data (ex: SQL, JSON, XML)
  • Surveys and Marketing (ex: multinomial modeling)
  • Systems Administration (ex: *nix, DBA, cloud tech.)
  • Temporal Statistics (ex: forecasting, time-series analysis)
  • Unstructured Data (ex: noSQL, text mining)
  • Visualization (ex: statistical graphics, mapping, web-based dataviz)

Evolution of Data Science and in-demand skills

In 2024, the hard truth: an overloaded definition and set of expectations leading to the Data Practitioner Soup

SQL, Python, and a dash or dark magic

How it started, how it’s going

~2012

2024

Expectations vs. Reality

The complexities of defining data science (and I have no intention of doing so)

Data Science can be considered a:

  • Science: Data science is viewed as a continuation of empirical science, which has always been centered around data, with historical examples like Kepler’s use of data to prove Copernicus’ theory.

  • Research Paradigm: It represents a shift in research methodology, moving from a deductive approach to a more inductive approach due to the abundance of data and computational resources.

  • Research Method: Data science is used to discover new concepts, measure their prevalence, assess causal effects, and make predictions, transforming the research process.

  • Discipline: The field is inherently interdisciplinary, integrating knowledge from domains such as computer science, mathematics, statistics, and specific application domains.

  • Workflow: It involves a series of steps including data collection, exploratory data analysis, modeling, and communication of results.

  • Profession: The multifaceted nature of data science work includes aspects of science, engineering, mathematics, statistics, and domain expertise, making it a distinct professional field.

(Hazzan and Mike 2023)

Mastering the broadness is extremely difficult and needs time

And teaching is even harder!

  • Broad Discipline
  • Complex Topics
  • Variety of Thinking Skills
  • Special Professional and Organizational Skills
  • Educators’ Background

(Hazzan and Mike 2023)

WWDSD?

What would a Data Scientist do?

The work we did in 2013 has over 146 citations over the last 10 years

Prior work on academic models for skills and competencies for Data Science

  1. EDISON Data Science Framework (EDSF): This framework includes a Competency Framework for Data Science (CF-DS) and a defined body of knowledge (DS-BoK). The CF-DS is developed around five major knowledge area groups: Data Analytics, Data Science Engineering, Data Management, Research Methods and Project Management, and Domain Related Competencies and Business Analytics Competencies. These areas define the explicit skills and knowledge that exemplify competence in data science.

  2. AIS IS 2010 Curriculum Guidelines: This curriculum is designed to educate and prepare graduates to enter the workforce by equipping them with knowledge and skills in three categories: IS-specific knowledge and skills, foundational knowledge and skills, and domain fundamentals. The “IS 2010” report is a collaborative effort between the Association for Computing Machinery (ACM) and the Association for Information Systems (AIS).

  3. Business Higher Education Forum (BHEF) Data Science and Analytics Competency Map: This map lists Data Science concepts and principles tiered into when and where these concepts are learned.

  4. ACM and IDASS Competencies: These documents provide a high-level list of competencies that undergraduate Data Science students should learn, with competencies directly comparable to EDISON’s CF-DS.

  5. Park City Math Institute Curriculum Guidelines for Undergraduate Programs in Data Science: This guideline does not elaborate on how to integrate the application domains knowledge into the curriculum but recognizes the importance of domain-related knowledge for practical work of a Data Scientist.

(Schmitt et al. 2023; Weiser et al. 2022; Hazzan and Mike 2023)

We need to be careful, though

Skills and Competencies

The difference between a skill and a competency is often related to the scope and integration of knowledge, abilities, and behaviors. A skill is typically understood as a specific learned activity that can be performed, often something that can be developed through practice. Competencies, on the other hand, are broader and include a combination of skills, knowledge, and attributes that enable someone to perform effectively in a job or situation 1 .

Competencies are often described as more holistic, encompassing not just the ability to perform a task (skill) but also the understanding (knowledge) and the appropriate application (attributes) of that skill in various contexts. They reflect a person’s capability to apply or use a set of related knowledge, skills, and abilities required to successfully perform “critical work functions” or tasks in a defined work setting 1 .

In summary, while skills are specific to certain tasks, competencies are more comprehensive and relate to the overall ability to perform a job effectively, which includes a combination of multiple skills, the knowledge of when and how to use them, and the attitude or behavior to perform them successfully 1 .

(Weiser et al. 2022)

Thinking models

Computational

  • Cognitive Ability: It is a problem-solving process that involves designing solutions to be implemented by a person, a computer, or both.
  • Independent of Technology: Its implementation is not dependent on technology.
  • Key Skills and Processes: These include problem formulation, dividing a problem into sub-problems, organizing and logically analyzing data, representing data with models and simulations using abstraction, suggesting and assessing solutions, examining and implementing the chosen solution, and generalizing the solution to a range of problems.
  • Social Skills: Teamwork, time management, and planning and scheduling tasks are also part of computational thinking.
  • Broad Knowledge: Emphasizes the acquisition of multidisciplinary knowledge and skills that can be applied in various contexts.

Statistical

  • Understanding of Data: Involves an understanding of the essence, characteristics, and variability of real-life data.
  • Statistical Inquiry: Covers the entire process of statistical inquiry, from data collection to analysis and interpretation.
  • Use of Statistical Methods: Addresses when and how to use specific statistical data analysis methods.
  • Sampling and Inference: Refers to the nature of sampling and how to infer from samples to populations.
  • Statistical Models: Includes the use of statistical models and their application.
  • Contextual Analysis: Considers the context of a given problem when performing investigations and drawing conclusions.

Mathematical

  • Mathematical Foundations: A core competence for data science graduates, involving a deep understanding of mathematical principles.
  • Model Building and Assessment: Skills in constructing and evaluating mathematical models relevant to data science.

Application Domain

  • Domain-Specific Knowledge: Understanding the core principles and ethical considerations within a specific application domain.
  • Data Curation: Involves managing and organizing data relevant to the domain.
  • Knowledge Transference: Communication and responsibility in transferring domain-specific insights.

Data

  • Analytical Thinking: Combines computational and statistical thinking to analyze data effectively.
  • Algorithms and Software Foundation: Knowledge of algorithms and software relevant to data science.
  • Data Curation Skills: Involves the ability to curate and manage data efficiently.
  • Communication and Responsibility: Skills in communicating findings and understanding the implications of data analysis.

(Hazzan and Mike 2023; Adhikari and Jordan 2021)

Proposed skills model

Call to action

What are you going to do to help fix this?

  1. Standardize Roles: Adopt a simplified and anchor categorization of job roles with clear definitions and expectations to reduce confusion and align understanding across the industry.

  2. Utilize Frameworks: Refer to established frameworks such as the IADSS Data Science Knowledge Framework to converge on the body of knowledge specification and ensure consistency in role definitions.

  3. Skills-Based Assessment: Implement objective, skills-based assessments to standardize the evaluation of data science professionals and ensure alignment with role requirements.

  4. Industry-Specific Classifications: Extend the work to include industry-specific role classifications that cater to the unique needs and contexts of different sectors.

  5. Benchmarking and Comparison: Compare and contrast standardized role definitions with actual industry practices, such as analyzing LinkedIn job posts, to ensure relevance and applicability.

  6. Communication and Education: Communicate the process and expectations clearly to all stakeholders, including hiring managers, executives, educators, and aspiring data science professionals, to align expectations.

  7. Continuous Evolution: Recognize that the field is dynamic, with new titles and roles emerging, and be open to further subclassification and evolution of data scientist roles.

(Fayyad and Hamutcu 2022)

References

Adhikari, Ani, and Michael I. Jordan. 2021. “Interleaving Computational and Inferential Thinking in an Undergraduate Data Science Curriculum.” Harvard Data Science Review, March. https://doi.org/10.1162/99608f92.cb0fa8d2.
Fayyad, Usama, and Hamit Hamutcu. 2022. “From Unicorn Data Scientist to Key Roles in Data Science: Standardizing Roles.” Harvard Data Science Review, July. https://doi.org/10.1162/99608f92.008b5006.
Harris, Harlan D., Sean Patrick Murphy, and Marck Vaisman. 2013. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. First edition. Beijing: O’Reilly.
Hazzan, Orit, and Koby Mike. 2023. Guide to Teaching Data Science: An Interdisciplinary Approach. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-24758-3.
Schmitt, Karl R. B., Linda Clark, Katherine M. Kinnaird, Ruth E. H. Wertz, and Björn Sandstede. 2023. “Evaluation of EDISON’s Data Science Competency Framework Through a Comparative Literature Analysis.” FoDS 5 (2): 177–98. https://doi.org/10.3934/fods.2021031.
Weiser, Orli, Yoram M. Kalman, Carmel Kent, and Gilad Ravid. 2022. “65 Competencies: Which Ones Should Your Data Analytics Experts Have?” Commun. ACM 65 (3): 58–66. https://doi.org/10.1145/3467018.

Technical

  • Images generated with the DALLE3 model on Azure OpenAI
  • Content summarization performed by:
    • Chunking PDFs
    • Embedding with text-embedding-ada-002
    • Using AzureAI hybrid search
    • Summarizing with GPT4
  • Presentation written in Quarto and rendered as reveal.js and presented in browser

Let’s talk!

marck.vaisman@microsoft.com

marck.vaisman@georgetown.edu

https://wahalulu.github.io/data-council-2024-effective-data-practitioner/